A co-training framework for searching XML documents

نویسندگان

  • Wilfred Ng
  • Ho Lam Lau
چکیده

In this paper, we study the use of XML tagged keywords (or simply key-tags) to search an XML fragment in a collection of XML documents. We present techniques that are able to employ users’ evaluations as feedback and then to generate an adaptive ranked list of XML fragments as the search results. First, we extend the vector space model as a basis to search XML fragments. The model examines the relevance between the imposed key-tags and identified fragments in XML documents, and determines the ranked result as an output. Second, in order to deal with the diversified nature of XML documents, we present four XML Rankers (XRs), which have different strengths in terms of similarity, granularity, and ranking features. The XRs are specially tailored to diversified XML documents. We then evaluate the XML search effectiveness and quality for each tailored XR and propose a Meta-XML Ranker (MXR) comprising the four XRs. The MXR is trained via a machine learning training scheme, which we term the Ranking Support Vector Machine (RSVM) in a Co-training Framework (RSCF). The RSCF takes as input two sets of labelled fragments and feature vectors and then generates as output adaptive rankers in Preprint submitted to Elsevier Science a learning process. We show empirically that, with only a small set of training XML fragments, the RSCF is able to improve after a few iterations in the learning process. Finally, we demonstrate that the RSCF-based MXR is able to bring out the strengths of the underlying XRs in order to adapt the users’ perspectives on the returned search results. By using a set of key-tag queries on a variety of XML documents, we show that the precision of the result of the RSCF-based MXR is effective.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

XML Document Classification with Co-training

This paper presents an algorithm of using Co-training with the precision/recall-driven decision-tree algorithm to handle the labeledunlabeled problem of XML classification. The two views are generated using a predicate rewrite system mechanism which is built on a higherorder logic representation formalism. Experimental results show that this method performs well on classifying XML documents usi...

متن کامل

Similarity Metric for XML Documents

Since XML documents can be represented as trees, Based on traditional tree edit distance, this paper presents structural similarity metric for XML documents ,which is based on edge constraint, path constraint, and inclusive path constraint, and similarity metric based on machine learning with node costs. It extends scope for searching XML documents, and improves recall and precision for searchi...

متن کامل

Prototyping a Vibrato-Aware Query-By-Humming (QBH) Music Information Retrieval System for Mobile Communication Devices: Case of Chromatic Harmonica

Background and Aim: The current research aims at prototyping query-by-humming music information retrieval systems for smart phones. Methods: This multi-method research follows simulation technique from mixed models of the operations research methodology, and the documentary research method, simultaneously. Two chromatic harmonica albums comprised the research population. To achieve the purpose ...

متن کامل

Searching Multi-hierarchical XML Documents: The Case of Fragmentation

To properly encode properties of textual documents using XML, mul­ tiple markup hierarchies must be used, often leading to conflicting markup in encodings. Text Encoding Initiative (TEI) Guidelines[1] recognize this problem and suggest a number of ways to incorporate multiple hierarchies in a single well-formed XML document. In this paper, we present a framework for pro­ cessing XPath queries o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Syst.

دوره 32  شماره 

صفحات  -

تاریخ انتشار 2007